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Abstract. To evaluate the performance of prediction of missing links, the known 
data are randomly divided into two parts, the training set and the probe set. We 
argue that this straightforward and standard method may lead to terrible bias, since 
in real biological and information networks, missing links are more likely to be links 
connecting low-degree nodes. We therefore study how to uncover missing links with 
low-degree nodes, namely links in the probe set are of lower degree products than 
a random sampling. Experimental analysis on ten local similarity indices and four 
disparate real networks reveals a surprising result that the Leicht-Holme-Newman 
index [E. A. Leicht, P. Holme, and M. E. J. Newman, Phys. Rev. E 73, 026120 
(2006)] performs the best, although it was known to be one of the worst indices if 
the probe set is a random sampling of all links. We further propose an parameter- 
dependent index, which considerably improves the prediction accuracy. Finally, we 
show the relevance of the proposed index on three real sampling methods. 
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1. Introduction 

Many social, biological, and information systems can be well described by networks, 
where nodes represent individuals and links denote the relations or interactions between 
nodes. The study of complex networks has therefore become a common focus of many 
branches of science. A fundamental tool for network analysis is the so-called link 
prediction, which attempts to estimate the likelihood of the existence of a link between 
two nodes, based on observed links and the attributes of nodes [H [2] . 

In many biological networks, such as food webs, protein-protein interaction 
networks and metabolic networks, whether a link between two nodes exists must be 
demonstrated by field and/or laboratory experiments, which are usually very costly. 
Our knowledge of these networks is very limited, for example, 80% of the molecular 
interactions in cells of Yeast [3] and 99.7% of human [U [5] are still unknown. Instead 
of blindly checking all possible interactions, to predict based on known interactions and 
focus on those links most likely to exist can sharply reduce the experimental costs if 
the predictions are accurate enough. Social network analysis also comes up against 
the missing data problem [6l [7j, where link prediction algorithms may play a role. In 
addition, the data in constructing biological and social networks may contain inaccurate 
information, resulting in spurious links [H [9] . Link prediction algorithms can be applied 
to identify these spurious links [10] . Besides helping in analyzing networks with missing 
data, the link prediction algorithms can be used to predict the links that may appear 
in the future of evolving networks. For example, in online social networks, very likely 
but not yet existent links can be recommended as promising friendships, which can help 
users in finding new friends and thus enhance their loyalties to the web sites. Other 
applications of link prediction include the evaluation of network evolving models 
the classification of partially labeled networks [12], and so on (see the review article [1] 
for the detailed discussion on real applications). 

To evaluate the algorithmic performance, the data set is divided into two parts: 
the training set is treated as known information while the probe set is used to estimate 
the prediction accuracy. To our knowledge, the datasets are always divided completely 
randomly. This is of course the most straightforward way, and it seems a very fair 
method without any statistical bias. However, this straightforward and standard 
method may lead to terrible bias, since in real biological and information networks, 
missing links are more likely to be links connecting low-degree nodes. For example, the 
known structure of the World Wide Web is just a sampling, where the hyperlinks from 
popular web pages have higher probability to be uncovered. In contrast, hyperlinks from 
unbeknown web pages are probably lost. Actually, in common sense, interaction between 
two significant proteins, hyperlink between two well-known web pages and relationship 
between two famous persons are of less probability to be missed. Accordingly, in this 
article, we study how to uncover missing links with low-degree nodes. That is to say, we 
divide the data set into two parts and make the links in the probe set less popular (i.e., 
of less degree products) than the links in the training set. Experimental analysis on 
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ten local similarity indices and four disparate real networks reveals a surprising result 
that the Leicht-Holme-Newman (LHN) index [13] performs the best, although it was 
known to be one of the worst indices if the probe set is a random sampling of all links 
[H]. We further propose an parameter-dependent index, which considerably improves 
the prediction accuracy. Finally, we show the relevance of the proposed index on three 
real sampling methods. 

This article is organized as follows. In the next section, we will clearly define the 
problem of link prediction, describe the standard process to evaluate the prediction 
accuracy, introduce the state-of-the-art local indices for node similarity and how to 
sample less popular links for the probe set. Experimental results for the traditional 
sampling method and the proposed method are presented in Section III. In Section IV, 
we will propose an improved index which performs even better than the LHN index. In 
Section V, we will introduce three mainstream sampling methods, and test the improved 
index on the corresponding sampled networks. Finally, we summarize our results in 
Section VI. 

2. Problem Description 

2.1. Link Prediction: Problem and Evaluation 

Given an undirected network G{V,E), where V and E are the sets of nodes and links 
respectively. The multiple links and self-connections are not allowed. Denote by U 
the universal set containing all possible links, where \V\ denotes the number 

of elements in set V. Then, the set of nonexistent links is U\E, in which there are 
some missing links, namely the existed yet unknown links or those that will form in the 
future. The task of link prediction is to uncover these links. Each node pair x and y 
will be assigned a score s^y according to a given prediction algorithm. The higher score, 
the higher probability that this link exists. The score matrix S is symmetry for G is 
undirected. All the nonexistent links are sorted in descending order according to their 
scores, and the top-ranked links are most likely to exist. 

To test the algorithmic accuracy, the observed links E are divided into two groups: 
the training set E^ is treated as known information, while the probe set E^ is used 
for testing and no information therein is allowed to be used for prediction. Clearly, 
E = E'^ U E^ and E'^ (1 E^ = 0. The accuracy of prediction is quantified by a standard 
metric called AUG (short for area under the receiver operating characteristic curve) |15j . 
Specifically, this metric can be interpreted as the probability that a randomly chosen 
missing link (links in E^) has higher score than a randomly chosen nonexistent link 
(links in U\E). In the implementation, among n times of independent comparisons, if 
there are n' times that the missing link has higher score and n" times the missing link 
and nonexistent link have the same score, AUG is calculated by "'+°-^"" . If all the scores 
are generated from an independent and identical distribution, the accuracy should be 
about 0.5. Therefore, the degree to which the accuracy exceeds 0.5 indicates how much 
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Table 1: Mathematical expressions of ten local similarity indices. Denote by kx and F. 
the degree of node x and the set of its neighbors. See Ref. [H] for details. 



„CN _ IP n r I — k y: k 

^xy ~ I'- X ' I J- 1/1 iixy ~ i^x ^ "-y 

Salton ^ |r:.nr^| Jaccard ^ ir^nr^l 

S0rensen ^ |r^nr^| HPI ^ iro^nr^l 

k^+ky ^xy mm{k^,ky} 

HDI ^ ir^nr^l LHN ^ |r^nr^| 

max{A:j;,fcy} xy k^xky 



better the algorithm performs than pure chance. 
2.2. Similarity Index 

The simplest framework of link prediction is the similarity-based algorithm, where each 
pair of nodes, x and y, is assigned a score s^y, which is directly defined as the similarity 
between them P, [161 IlZj- All non-observed links are ranked according to their scores, 
and the links connecting more similar nodes are supposed to be of higher existence 
likelihoods. Owning to its simplicity, the study on similarity-based algorithms is the 
mainstream issue. 

In this article, we adopt the simplest local similarity indices. Zhou et al. [H] have 
investigated the performances of these ten local indices, including Common Neighbors 
(CN), Salton index [18], Jaccard index [19], S0rensen index [20], Hub Promoted index 
(HPI) [21], Hub Depressed index (HDI) [H], Leicht-Holme- Newman index (LHN) [13], 
Preferential Attachment (PA) [22| [23] . Adamic-Adar Index (AA) [21] and Resource 
Allocation index (RA) [HI [25]. It was shown that the RA index performs best, and 
LHN and PA indices perform the worst. However, these results are obtained based on 
random probe set division. In this article, we will compare the performances of these 
ten indices on predicting the missing links with low-degree nodes. Their mathematical 
expressions are shown in Table [1] The detailed information on these ten indices can be 
found in Refs. [HI I]. Note that, the above indices except PA, are all common-neighbor 
based. Therein Salton index, Jaccard index, S0rensen index, HPI, HDI and LHN are 
different in the dominators which take into account the degrees of the two endpoints 
of the predicted links, while AA and RA indices consider the effects of their common 
neighbors' degrees. 
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2.3. Sampling for Probe Set 

Traditionally, the probe links are randomly selected from E, namely each link has equal 
probability to be selected into probe set (called random sampling). In this way, the 
algorithmic accuracy measured by AUG is actually an average prediction accuracy of 
the probe links. However, the links may have different predictabilities for their different 
roles in the network. Some prediction algorithms may be good at predicting the links 
connecting the high-degree nodes, while some are adept in the links connecting the low- 
degree nodes. Therefore, in order to evaluate the performance of different algorithms 
on different links, the dataset should be divided with preference. 

Motivated by evaluating the algorithm's performance on uncovering the links with 
low-degree nodes, in this paper, we propose a preferential partition method according 
to the link popularity which is defined as: 

PoP{x,j/) = (^x - 1) X {ky - 1), (1) 

where k^ denotes the degree of node x. Clearly, links with high-degree endpoints have 
higher popularities than those with low-degree ends. Thus for a given network, the links 
whose popularities are higher than the average popularity (pop) are popular links, and 
those with lower popularities than (pop) are unpopular links. The detailed partition 
steps are as follows: (i) Calculate the popularity score of each observed link according 
to Eq. [H and rank these links in descending order based on their popularity scores, 
(ii) Uniformly divide this list from down to up into D groups respectively denoted by 
El, E2, Ed- Clearly, E1UE2U ... U Ed = E and Ei fl Ej = cf), {i,j = 1, 2, ...D, i ^ j). 
The popularity of each link in Ei is no higher than that in Ej if i < j. (iii) For each subset 
Ei, we randomly choose half of the links therein to constitute the probe set labeled by 
Ef. Then the rest links (i.e., E\E[') constitute the corresponding training set labeled 
by Ef. Denote by (pop)j the average popularity of the links in probe set Ef, we have 
(pop)i < (pop)2 <,■■■, (pop)^. E[ consisting of the most unpopular links are called 
cold probe set in this article. We design this sampling method for the convenience of 
theoretical analysis. However, this method is far different from real sampling methods. 
We will therefore test the relevance and validity of our main results on real sampling 
methods in Section 5. 

3. Experimental analysis 

3.1. Data Description 

We consider four representative networks drawn from disparate fields: (i) USAir: The 
network of US air transportation system, which contains 332 airports and 2126 airlines 
[26]. (ii) NetScience: A network of coauthorships between scientists who are themselves 
publishing on the topic of networks [27]. The network contains 1589 scientists, 128 
of which are isolated. Here we consider the largest component that contains only 379 
scientists, (iii) C.elegans: The neural network of the nematode worm C.elegans, in 
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which an edge joins two neurons if they are connected by either a synapse or a gap 
junction [28]. This network contains 297 neurons and 2148 hnks. (iv) Pohtical Blogs: 
The network of US pohtical blogs [29], the original links are directed, here we treat 
them as undirected links. Table |2] summarizes the basic topological features of these 
networks. Brief definitions of the monitored topological measures can be found in the 
table caption. For more details, one can see the review articles [301 EH Ell [331 El]- 

Table 2: The basic topological features of the giant components of the four example 
networks. NS, CE and PB are the abbreviations for NetScience, C.elegans and Political 
Blogs networks respectively. = |\^| and M = \E\ are the total number of nodes and 
links, respectively. C and r are clustering coefficient [28] and assortative coefficient [35] . 
(k) is the average degree of network, (d) is the average shortest distance between node 
pairs. H denotes the degree heterogeneity defined as H = 
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3.2. Results for Random Probe Set 

As we have mentioned above, the mainstream method to prepare the probe set is random 
sampling, namely all the links in are randomly chosen from the whole link set E. 
In this way, AUG gives the average performance on predicting the links in probe set. 
For example, Zhou et al. compared ten local similarity indices on five real networks [H] 
with this randomly selected probe set, and gave an overall evaluation measured by AUG. 
Instead of obtaining a collective performance of the whole probe set, here we investigate 
the algorithm's performance on each link. The accuracy of one link is defined as the 
probability that this link has higher score than that of one randomly chosen nonexistent 
link. The dependence of four typical algorithms' accuracies on the popularity of links 
is shown in Fig. [1] Note that the popularity of each link in probe set is calculated 
according to the initial dataset, not the training set. Fig. [It^a)-(l) show that the AUG 
increases with the increasing of link popularity. This indicates that the GN, PA and 
RA algorithms tend to give higher accurate predictions on popular links, especially in 
USAir, PB and G.elegans networks. In comparison, the LHN index, which has been 
demonstrated to be a low accurate method in previous works, can give higher accurate 
prediction on the unpopular links. The reason is LHN is likely to assign higher score to 
the unpopular links by using k^ x ky as its dominator to depress the scores of popular 
links. 
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Figure 1: The dependence of algorithmic accuracy on the popularity of links. The 
results for four typical indices, namely CN, PA, RA and LHN are shown respectively in 
four columns. Each subgraph is obtained by averaging over 100 implementations with 
independently random partitions into training set and probe set. The probe set contains 
5% of observed links. Note that, the AUG value corresponds to the average AUG of 
links with the same popularity. The statistics are conducted with log-bin of popularity. 



We further investigate the average popularity of the top-L predicted links of 
different algorithms. In principle, a link prediction algorithm provides a descending 
ordered list of all non-observed links (i.e., U\E^ in our experiment) according to their 
scores, of which we only focus on the top-L links. Then, the top-L popularity is 
defined as the average popularity of links among the top-L places. A low top-L average 
popularity indicates that the algorithm tends to rank the missing links connecting low- 
degree nodes at the top places. Table [3] shows the top-L popularity of ten local similarity 
indices on G.elegans network. For GN, PA, AA and RA indices the top-100 popularity 
scores are extremely high, lager than 100, and the scores will decrease with the increasing 
of L. This indicates that these four indices tend to rank the popular links at the top 
places. For the other six indices, namely Salton, Jaccard, S0rensen, HPI, HDI and LHN, 
the top-100 popularity scores are very low. Especially the score of LHN index is very 
small, approximate to zero, and will increase with the increasing of L, suggesting that 
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the LHN index is likely to assign higher scores to the links among whose two endpoints 
there is at least one node with degree equal to 1. When L is large, the overlap of two 
ranking lists generated by two algorithms is very high, and thus leads to similar top-L 
popularity scores. This result further demonstrates that LHN index is more competent 
to uncover the unpopular links, especially the links with very low-degree nodes. 

Table 3: Algorithmic novelty of all local similarity indices on C.elegans, measured 
by the top-L popularity. Sal, Jac and S0r are the abbreviations for Salton, Jaccard 
and S0rensen methods respectively. Each number is obtained by averaging 100 
implementations with independently random divisions into training set and probe set. 
The probe set contains 5% of observed links. Here L= 100, 500, 1000, 5000, 10000, 
20000. 
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3.3. Results for Cold Probe Set 

We employ the new partition method to prepare the probe set for experiments. Here 
we set D = 10. Thus, we obtain ten different probe sets Ef {i = 1, ■ ■ ■, 10). Clearly, 
each probe set contains 5% of observed links, and (pop)i < (pop)2 < ■ ■ ■ < (pop)io. 
The algorithmic performances on C.elegans network for different probe sets are shown 
in Table IH The results for other three networks are similar. 

Compared with other nine indices, LHN has the best performance for predicting 
the very unpopular links (the links in E[ and E2), while has the worst performance on 
the links in the probe sets with P > 5, especially the popular links in E[, Eg and E[q. 
On contrary, PA index gives very good predictions on the popular links, while extremely 
bad predictions on the links with low-degree nodes where the accuracy is even much 
lower than the random case. In the middle region where the average popularity is close 
to that of the randomly selected probe set, RA index outperforms others, which is in 
accordance with the conclusion in previous studies [HI [361 EZ] • 
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Table 4: Algorithmic accuracies on C.elegans for different probe sets, measured by the 
AUG value. Sal, Jac and S0r are the abbreviations for Salton, Jaccard and S0rensen 
methods respectively. Each value is obtained by averaging over 100 implementations 
with independently divisions of training set and probe set using new partitioning 
method. The average popularity of these ten probe sets are shown in the brackets. 
The average popularity of the whole set is 523. The highest AUG value in each row is 
emphasized in black. 
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Figure 2: The dependence of AUG on different A and P. These results are obtained 
by averaging over 100 implementations with independently divisions of training set and 
probe set (by the new partitioning method introduced in Sec. II-G and Sec. III-G). 
Here we set D = 10. 
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4. Improved Index 

To design a method for effectively predicting both popular and unpopular links, we 
propose a parameter-dependent index, which is defined as: 

ir n r I 

= J±^l_[±yL^ (2) 

(/Cx X ky) 

where A is a free parameter. This index is also neighborhood-based and requires only 
the information of the nearest neighbors, and thus no extra calculational complexity 
arises. Clearly, when A = 0, this index degenerates to CN, and for the cases A = 0.5 
and 1, this index respectively degenerates to the Salton index and LHN index. Given a 
network, one can tune A to find its optimal value subject to the highest accuracy. 

We apply the new index to respectively predict the links in Ef {i = 1, ■ ■ ■ , 10). 
The results of four example networks in the (A,P) plane are shown in Fig. [2] where we 
focus on the unpopular links {i = 1, • • • , 5) and P = i means that Ef is employed as 
the probe set. The results show that the optimal A is positive when P is small (i.e., 
P = 1,2), while it becomes negative for large P. For NetScience A* becomes negative 
for P = 6. The dependence of optimal parameter A* on P is shown in Fig. [3l Overall 
speaking, the optimal parameters A* of four networks are negatively correlated with P. 
In other word, the index with higher (positive) A gives better predictions on unpopular 
links, while the index with lower (negative) A is good at predicting popular links. For 
example in C.elegans network, when P = 1, namely the probe set is constituted with 
unpopular links, the optimal A* = 2.2, indicating that to depress the scores of popular 
links is a better choice, while for P = 10, namely the probe set is constituted with 
popular links, the optimal A* ~ —3, which indicates that we had better assign higher 
score to the popular links. 




Figure 3: The dependence of the optimal parameter A* on P. The probe set contains 5% 
of the known links. Each value is obtained by averaging the results of 100 independently 
implement at ions . 



The algorithmic accuracies of ten local similarity indices as well as the proposed 
index for predicting unpopular links in E[ (i.e., P = 1) are shown in Table [51 Among 
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the investigated ten local similarity indices, LHN outperforms others for predicting 
unpopular links. However, the proposed index can improve the accuracy with a proper 
A which are all positive for all these four networks. Especially, the improvements on 
PB and C.elegans networks are significant, respectively 10.8% and 8.6% compared with 
LHN. 

Table 5: Comparison of algorithmic accuracy (AUG) on the cold probe set (P = 1). NS, 
CE and PB are the abbreviations for NetScience, C.elegans and Political Blogs networks 
respectively. Sal, Jac and S0r are short for Salton, Jaccard and S0rensen methods. For 
each network, the probe set contains 5% of the known links. Each value is obtained by 
averaging over 100 independently implementations. The entries corresponding to the 
highest accuracies are emphasized in black. For the proposed index, the AUC values 
are corresponding to the optimal parameters which are shown in the brackets. 



Datasets 



CN Sal Jac S0r HPI HDI LHN PA AA RA 



New 



PB 0.649 0.674 0.664 0.664 0.690 0.662 0.701 0.584 0.656 0.664 0.777(2.6) 

USAir 0.742 0.888 0.881 0.880 0.831 0.875 0.903 0.370 0.792 0.818 0.928(1.2) 

NS 0.973 0.991 0.991 0.991 0.988 0.990 0.992 0.095 0.980 0.981 0.993(0.8) 

CE 0.615 0.724 0.723 0.723 0.713 0.709 0.756 0.247 0.633 0.654 0.821(2.2) 



5. Experiments on Real Sampling Methods 



Table 6: Average popularities of missing links corresponding to different sampling 
methods. 80% and 90% means the proportion jE'-^l/li?!. 
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To connect our study to the real sampled networks, in this section, we will test the 
improved index on some real sampling methods. Firstly, we introduce four mainstream 
sampling methods as follows. 

The first one is called snowball sampling (i.e., spider sampling or breadth first 
sampling, see Ref. [3H]), which is a non-probability technique and gets widely used in 
the studies of World Wide Webs and large-scale social networks. In the beginning of this 
method, we randomly select one or a few nodes that consist of the initial sampled set, 
and then we crawl all the neighbors of the nodes in the sampled set, and put them into 
the sampled set. This process keeps on until a required number of nodes are sampled 
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Figure 4: Algorithmic performance of the improved index (see Eq. 2) for the three real 
sampling methods with different values of A. The training set contains 90% of observed 
links. 



out. Obviously, it is not relevant for the link prediction problem because this method 
only leaves missing nodes rather than missing links. 

The second one, called acquaintance sampling, is motivated by epidemic 
immunization with lack of information |39]. In this method, at each time step, a 
random link of a randomly selected node is sampled out (i.e., being put into the training 
set) until a required number of links have already been selected. Considering a link 
{x,y), if it is not yet sampled out, the probability it will be selected at this time step 
is + ^). Although a link with lower popularity is not necessarily with high 

r- + r-, statistically speaking, the probability 4(r- + r-) is negatively correlated with 

X y X y 

the popularity {kx — ^){ky — 1). To our knowledge, this method is a very special method 
where unpopular links are more likely to be sampled out yet the popular links consist 
of the probe set. 

The third one is named random-walk sampling ^0]. A simple way adopted is as 
follows: (i) initialize a particle on a randomly selected node; (ii) this particle jumps to a 
randomly selected neighbor and the corresponding link will be added into the training 
set (i.e., sampled out); (iii) repeat (ii) until a certain number of links have been sampled 
out, and the rest links compose the probe set. It is well-known that the distribution of 
visiting frequency of a random walker on a connected network will soon converges to the 
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degree distribution, namely the probability at a certain time step the random walker 
locate in a node x, say ip{x), is equal to where 2M = Yly ky serves as a normalization 
factor. Considering a link {x,y), if it is not yet sampled out, the probability it will be 
selected at this time step is ijj(x)-^ +4'{y)-^ = jj- That is to say, the average popularity 
of links in the probe set is approximately the same to that from the random sampling 
(we have checked it by simulation). However, the random- walk sampling is not the 
same as random sampling, for example, the sampled network from the former is always 
connected yet the one from the latter may contain several components. 

The last one is called path-based sampling, which has been applied in extracting the 



topology of Internet at router level (http://www.routerviews.org). Indeed, this method 



tracks the transmission of information packets in the Internet, and a link passed by 
more packets has higher probability to be sampled out. To simulate this process, at 
each time step, we randomly select a starting point and an end, and we assume that a 
packet will go from the starting point to the end through a randomly selected shortest 
path. After a sufficiently large number of time steps, a link with more than a threshold, 
Nt, packets will be put into the training set while others compose the probe set. Here 
for simplicity, we set = 1. Under this method, the links with high betweenness 
centrality (betweenness centrality quantifies the traffic load of a link, depending on 
the routing strategy of packet transmission [H]) are favored. Since the popularity of 
a link is strongly positively correlated with its betweenness centrality on shortest-path 
routing, the average popularity of links in the probe set is lower than that of the random 
sampling. 
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Figure 5: The optimal values of A for the three sampling methods. The dash line denotes 
A = as an eye guidance. 



The average popularities of missing links corresponding to different sampling 
methods are shown in Table 6. Agreeing with our analysis, the average popularities 
of the links in the probe set obeys the inequality {pop) acquaintance > {v°v)randoin-waik > 
{pop) path-based- Figure 4 reports the algorithmic performance (measured by the AUG 
value) for different sampling methods and different A. Very clearly, aiming to predict 
popular links (e.g., under acquaintance sampling) A* is negative while to predict cold 
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links, A* is positive. In fact, there is strongly negative correlation between the average 
popularity of links in the probe set and the optimal value of lambda. As shown in Figure 

5, where we order the three sampling methods with decreasing average popularity of 
missing links and thus a positive correlation is observed. 

6. Conclusion and discussion 

To our knowledge, in the previous studies on link prediction [1], the data sets are always 
divided in a random manner. Inspired by the in-depth thought about the features of 
missing links, this article challenges such a straightforward method. Applying a simple 
measure on link popularity, we propose a method to sample less popular links for probe 
set. Experimental analysis shows a surprising result that the LHN index performs 
the best, although it was known to be one of the worst indices if the probe set is a 
random sampling of all links. We propose a similarity index with a free parameter 
A, by tuning which this index can degenerate to the Common Neighbor index, the 
Salton index and the LHN index. The optimal value of A monotonously depends on the 
average link popularity of probe set. We further test this index on three real sampling 
methods. Agreeing with the main results from theoretical analysis, the optimal value of 
A increases with the decreasing of the average popularity of links in the probe set. Again, 
the improved index in a well-tuned range can outperform others under real sampling 
methods. 

Notice that, the main contribution of this article does not lie on the proposed 
index. Instead, the significance of this work is to raise the serious question about how to 
properly determine the probe set. To us, this is a very important yet completely ignored 
problem in information filtering. The reconsideration of dataset division will largely 
change the understanding and thus the design of algorithms in information filtering 
(also relevant to the so-called recommendation problem |12])- As a start point, we give 
a naive method and a preliminary analysis, which is of course far from an satisfied answer 
to the question. In fact, we think in-depth understanding of real sampling methods may 
shed light on this issue. 
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